1 Climate change and temperature anomalies

We can find data on the Combined Land-Surface Air and Sea-Surface Water Temperature Anomalies in the Northern Hemisphere at NASA’s Goddard Institute for Space Studies. The tabular data of temperature anomalies can be found here

weather <- 
  read_csv("https://data.giss.nasa.gov/gistemp/tabledata_v4/NH.Ts+dSST.csv", 
           skip = 1, 
           na = "***")
#inspecting data.frame 'weather'

#glimpse(weather) 

# selecting the year and 12 month variables
weather_clean <- weather %>% 
  select(1:13) 

# converting to the dataframe to longer format (tidyweather)
tidyweather <- pivot_longer(weather_clean, cols = 2:13, names_to = "Month", values_to = "delta")

tidyweather
## # A tibble: 1,716 × 3
##     Year Month delta
##    <dbl> <chr> <dbl>
##  1  1880 Jan   -0.36
##  2  1880 Feb   -0.5 
##  3  1880 Mar   -0.23
##  4  1880 Apr   -0.29
##  5  1880 May   -0.06
##  6  1880 Jun   -0.16
##  7  1880 Jul   -0.17
##  8  1880 Aug   -0.25
##  9  1880 Sep   -0.22
## 10  1880 Oct   -0.3 
## # … with 1,706 more rows

1.1 Plotting Information

We plot the data using a time-series scatter plot, and add a trendline. To do that, we first need to create a new variable called date in order to ensure that the delta values are plot chronologically.

tidyweather <- tidyweather %>%
  mutate(date = ymd(paste(as.character(Year), Month, "1")),
         month = month(date, label=TRUE),
         year = year(date))

ggplot(tidyweather, aes(x=date, y = delta))+
  geom_point()+
  geom_smooth(color="red") +
  theme_bw() +
  labs (
    title = "Weather Anomalies"
  )

1.2 Is the effect of increasing temperature more pronounced in some months?

Use facet_wrap() to produce a seperate scatter plot for each month, again with a smoothing line.

ggplot(tidyweather) +
  aes(x = date, y = delta) +
  geom_point () +
  geom_smooth (color = "red") +
  theme_bw() +
  labs (title = "Weather Changes by Month") +
  facet_wrap(~ month)

- Delta on the y-axis, represents the temperature deviations from the expected value. The red smooth lines we drew are the trend lines for the temperature changes. The steeper the smooth lines are, the higher the rates of temperature increases.

  • Before 1950, visually, we can see that for Januarys and Februaries, the effect of increasing temperature was more pronounced. For instance, the smooth line drawn for January is much steeper than that of July and August.

  • We can tell that the year 1950 was roughly a point, after which, the smooth line becomes much steeper for all months compared to before 1950. There are months where the effect of increasing temperature was more pronounced, but these effects are not distinctive visually.

1.3 Removed Data before 1880

It is sometimes useful to group data into different time periods to study historical data. For example, we often refer to decades such as 1970s, 1980s, 1990s etc. to refer to a period of time. NASA calcuialtes a temperature anomaly, as difference form the base periof of 1951-1980. The code below creates a new data frame called comparison that groups data in five time periods: 1881-1920, 1921-1950, 1951-1980, 1981-2010 and 2011-present.

We remove data before 1800 and before using filter. Then, we use the mutate function to create a new variable interval which contains information on which period each observation belongs to. We can assign the different periods using case_when().

comparison <- tidyweather %>% 
  filter(Year>= 1881) %>%     #remove years prior to 1881
  #create new variable 'interval', and assign values based on criteria below:
  mutate(interval = case_when(
    Year %in% c(1881:1920) ~ "1881-1920",
    Year %in% c(1921:1950) ~ "1921-1950",
    Year %in% c(1951:1980) ~ "1951-1980",
    Year %in% c(1981:2010) ~ "1981-2010",
    TRUE ~ "2011-present"
  ))

Now that we have the interval variable, we can create a density plot to study the distribution of monthly deviations (delta), grouped by the different time periods we are interested in. Set fill to interval to group and colour the data by different time periods.

ggplot(comparison) +
  aes(x = delta, fill = interval) +
  geom_density(alpha = 0.3) +
  theme_minimal(base_size = 18)

So far, we have been working with monthly anomalies. However, we might be interested in average annual anomalies. We can do this by using group_by() and summarise(), followed by a scatter plot to display the result.

#creating yearly averages
average_annual_anomaly <- tidyweather %>% 
  group_by(Year) %>%   #grouping data by Year
  
# creating summaries for mean delta 
# use `na.rm=TRUE` to eliminate NA (not available) values 
  summarise(mean_delta = mean(delta), na.rm = TRUE) 

#plotting the data:
ggplot(average_annual_anomaly) +
  aes(x = Year, y = mean_delta) +
  geom_point() +
  
  
#Fit the best fit line, using LOESS method
#change theme to theme_bw() to have white background + black frame around plot
  aes(x = Year, y = mean_delta) +
  geom_smooth(method = loess, color = "red") +
  theme_bw( base_size = 18) +
  labs (title = "Average Annual Anomolies Over Time", x = "Year", y = "Average Annual Temperature Change") 

1.4 Confidence Interval for delta

1.4.1 Using the formula

# choose the interval 2011-present
# what dplyr verb will you use? 
  formula_ci <- comparison %>% 
  filter(interval == "2011-present") %>% 
  
# calculate summary statistics for temperature deviation (delta) 
# calculate mean, SD, count, SE, lower/upper 95% CI
# what dplyr verb will you use
    summarise(mean = mean(delta, na.rm = TRUE),
            count = n(),
            SD = sd(delta, na.rm = TRUE),
            SE = SD / sqrt(count),
            # find t critical value with n-1 degrees of freedom
            t_critical = qt(0.975,count -1),
            margin_of_error = SE * t_critical,
            lower_95pct_CI = mean - margin_of_error,
            upper_95pct_CI = mean + margin_of_error)
  
# print out formula_CI
formula_ci
## # A tibble: 1 × 8
##    mean count    SD     SE t_critical margin_of_error lower_95pct_CI upper_95p…¹
##   <dbl> <int> <dbl>  <dbl>      <dbl>           <dbl>          <dbl>       <dbl>
## 1  1.07   144 0.266 0.0222       1.98          0.0438           1.02        1.11
## # … with abbreviated variable name ¹​upper_95pct_CI

1.4.2 Using Infer() package

# calculate ci using the infer package (bootstrap simulation)
set.seed(1234)
boot_ci <- comparison %>% 
  # select the interval of interest
  filter(interval == "2011-present") %>% 
  # specify the variable of interest
  specify (response = delta) %>% 
  # generate bootstrap samples 
  generate (reps = 1000, type = "bootstrap") %>% 
  # find mean
  calculate(stat = "mean") %>% 
  # find the ci
  get_confidence_interval(level = 0.95, type = "percentile")

# print out boot_ci
boot_ci
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1     1.02     1.11
  • We are 95% confident that the average temperature change in the 2011-present time period is between 1.02 and 1.11 degree-celcius.
  • We get very similar results for confidence intervals from the bootstrap method as well as the formula method.

2 Biden’s Approval Margins

approval_polllist <- read.csv('https://projects.fivethirtyeight.com/biden-approval-data/approval_polllist.csv') 

approval_polllist <- approval_polllist %>% 
  mutate(enddate = mdy(enddate), 
         year = year(enddate),
         month = month(enddate), #default format in r is year ymd
         week = isoweek(enddate)) # using lubridate to change the format of the dates. 

approval_polllist<- approval_polllist%>%
  filter(year == 2022) %>% # filter to remove all other years other than 2022.
  group_by(week, subgroup) %>% # grouping by week then subgroup, so that we can conitnue analysis
  summarise(app_disapp = approve-disapprove, #finding the net ratings
          mean_approve_disapprove = mean(app_disapp), # finding mean of net ratings
          sd_rating = sd(app_disapp), # finding sd
          count=n(), # counting number of polls in each week
          t_critical=qt(0.975,count-1) , #t distribution, the bigger the sample count, the faster t_critical approach 1.96
          se_approve_disapp = sd(app_disapp)/sqrt(count), # finding standard error
          margin_of_error = t_critical*se_approve_disapp,  
          rating_low = mean_approve_disapprove - margin_of_error, # make our ribbon bands
          rating_high = mean_approve_disapprove+ margin_of_error 
         )

ggplot(approval_polllist, aes(x=week, y=mean_approve_disapprove))+geom_ribbon(aes( ymax = rating_high, ymin=rating_low, fill = subgroup))+ geom_line()+ facet_grid(rows = vars(subgroup)) + labs(x = 'Week in 2022', y ='', title = "Biden's Net Approval Ratings in 2022", subtitle = "Weekly Data, Approve - Disapprove, %")+ylim(-25,-5)+xlim(0,35)

For Biden’s net approval ratings, we can see similar trends along the 3 sub-groups for the weeks in 2022.

3 Challenge 1: Excess rentals in TfL bike sharing

url <- "https://data.london.gov.uk/download/number-bicycle-hires/ac29363e-e0cb-47cc-a97a-e216d900a6b0/tfl-daily-cycle-hires.xlsx"

# Download TFL data to temporary file
httr::GET(url, write_disk(bike.temp <- tempfile(fileext = ".xlsx")))
## Response [https://airdrive-secure.s3-eu-west-1.amazonaws.com/london/dataset/number-bicycle-hires/2022-09-06T12%3A41%3A48/tfl-daily-cycle-hires.xlsx?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJJDIMAIVZJDICKHA%2F20220911%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20220911T204946Z&X-Amz-Expires=300&X-Amz-Signature=193aa70b7c43ddb9942e4610910d11499b295eac7b9c15f1f49d96c664168370&X-Amz-SignedHeaders=host]
##   Date: 2022-09-11 20:51
##   Status: 200
##   Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
##   Size: 180 kB
## <ON DISK>  /var/folders/7x/psgggfkx7638wxs95nnmhr_00000gn/T//Rtmpzl6nt9/file173e8792507ec.xlsx
# Use read_excel to read it as dataframe
bike0 <- read_excel(bike.temp,
                   sheet = "Data",
                   range = cell_cols("A:B"))

# change dates to get year, month, and week
bike <- bike0 %>% 
  clean_names() %>% 
  rename (bikes_hired = number_of_bicycle_hires) %>% 
  mutate (year = year(day),
          month = lubridate::month(day, label = TRUE),
          week = isoweek(day))

3.1 Monthly Changes in in TfL bike rentals

3.2 Weekly Changes in TfL bike rentals

The second one looks at percentage changes from the expected level of weekly rentals. The two grey shaded rectangles correspond to Q2 (weeks 14-26) and Q4 (weeks 40-52).

3.2.1 Should you use the mean or the median to calculate your expected rentals? Why?

We use the mean to calculate the expected rentals. We used the mean to calculate our expected rentals because the TFL may be more inclined to use the data for profit and revenue analysis. Mean, instead of median, may be more accurate to represent the revenue from bike rental.

4 Challenge 2: Share of renewable energy production in the world

The National Bureau of Economic Research (NBER) has a a very interesting dataset on the adoption of about 200 technologies in more than 150 countries since 1800. This is theCross-country Historical Adoption of Technology (CHAT) dataset.

The following is a description of the variables

variable class description
variable character Variable name
label character Label for variable
iso3c character Country code
year double Year
group character Group (consumption/production)
category character Category
value double Value (related to label)
technology <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-07-19/technology.csv')

#get all technologies
labels <- technology %>% 
  distinct(variable, label)

# Get country names using 'countrycode' package
technology <- technology %>% 
  filter(iso3c != "XCD") %>% 
  mutate(iso3c = recode(iso3c, "ROM" = "ROU"),
         country = countrycode(iso3c, origin = "iso3c", destination = "country.name"),
         country = case_when(
           iso3c == "ANT" ~ "Netherlands Antilles",
           iso3c == "CSK" ~ "Czechoslovakia",
           iso3c == "XKX" ~ "Kosovo",
           TRUE           ~ country))

#make smaller dataframe on energy
energy <- technology %>% 
  filter(category == "Energy")

# download CO2 per capita from World Bank using {wbstats} package
# https://data.worldbank.org/indicator/EN.ATM.CO2E.PC
co2_percap <- wb_data(country = "countries_only", 
                      indicator = "EN.ATM.CO2E.PC", 
                      start_date = 1970, 
                      end_date = 2022,
                      return_wide=FALSE) %>% 
  filter(!is.na(value)) %>% 
  #drop unwanted variables
  select(-c(unit, obs_status, footnote, last_updated))

# get a list of countries and their characteristics
# we just want to get the region a country is in and its income level
countries <-  wb_cachelist$countries %>% 
  select(iso3c,region,income_level)

This is a very rich data set, not just for energy and CO2 data, but for many other technologies. In our case, we just need to produce a couple of graphs– at this stage, the emphasis is on data manipulation, rather than making the graphs gorgeous.

First, produce a graph with the countries with the highest and lowest % contribution of renewables in energy production. This is made up of elec_hydro, elec_solar, elec_wind, and elec_renew_other. You may want to use the patchwork package to assemble the two charts next to each other.

library(patchwork)

en_new <- energy %>%  #filter energy dataset for year + create new variable for all renewable energy
  filter(year == 2019) %>% 
  group_by(country, variable) %>% 
  summarise(count = sum(value)) %>% 
  ungroup() %>% 
  pivot_wider(names_from = "variable", values_from = "count") %>% 
  mutate(renew_en = elec_hydro + elec_solar + elec_wind + elec_renew_other)

en_new[is.na(en_new)] <- 0 #establish renewable energy usage as a percentage of energy as a whole
en_new <- en_new %>% 
mutate(percent = renew_en/ elecprod*100) %>% 
arrange(desc(percent)) %>% 
filter(renew_en != 0, percent != Inf)
en_new
## # A tibble: 196 × 13
##    country       elec_…¹ elec_…² elec_…³ elec_…⁴ elec_…⁵ elec_…⁶ elec_…⁷ elec_…⁸
##    <chr>           <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 Albania             0 0         5.15        0 0        0        0.022 0      
##  2 Nepal               0 0         5.33        0 0        0        0.08  0.014  
##  3 Lesotho             0 0         0.541       0 0        0        0.001 0      
##  4 Bhutan              0 0         6.61        0 4.67e-5  0        0.001 0.001  
##  5 Paraguay            0 0        48.9         0 1.88e-3  0.886    0     0      
##  6 Iceland             0 0        13.3         0 2.82e-3  5.93     0     0.00642
##  7 Congo - Kins…       0 0.00246  11.0         0 1.30e-3  0.0290   0.01  0      
##  8 Ethiopia            0 0        14.0         0 5.41e-3  0.0294   0.02  0.533  
##  9 Central Afri…       0 0         0.15        0 1   e-3  0        0     0      
## 10 Costa Rica          0 0         7.75        0 9.02e-2  1.68     0.076 1.80   
## # … with 186 more rows, 4 more variables: elecprod <dbl>, elec_cons <dbl>,
## #   renew_en <dbl>, percent <dbl>, and abbreviated variable names ¹​elec_coal,
## #   ²​elec_gas, ³​elec_hydro, ⁴​elec_nuc, ⁵​elec_oil, ⁶​elec_renew_other,
## #   ⁷​elec_solar, ⁸​elec_wind
#create plot 1 for highest % of countries with renewable energy
c2p1 <- ggplot(en_new %>% slice_max(order_by = percent, n = 20), aes(x = percent, y = fct_reorder(country, percent))) + geom_col() + 
labs(title = "Highest and Lowest Percentage of Renewables", x = NULL, y = NULL)
c2p1

#create plot 2 for lowest % of countries with renewable energy
dfsid<- en_new %>%
  slice_min(order_by = percent, n=20)


c2p2 <- ggplot(dfsid, aes(x= percent, y=fct_reorder(country, percent))) + geom_col() + labs(title ='graph2',y = NULL, x = NULL)
c2p2

#combining plots
c2p1 + c2p2

Second, you can produce an animation to explore the relationship between CO2 per capita emissions and the deployment of renewables. As the % of energy generated by renewables goes up, do CO2 per capita emissions seem to go down?

To create this animation is actually straight-forward. You manipulate your date, and the create the graph in the normal ggplot way. the only gganimate layers you need to add to your graphs are

  labs(title = 'Year: {frame_time}', 
       x = '% renewables', 
       y = 'CO2 per cap') +
  transition_time(year) +
  ease_aes('linear')
library(gifski)
library(png)

new_co2_percap <- merge(co2_percap, countries, by="iso3c") #merge all the data into one  dataset
new_co2_percap <- merge(new_co2_percap, en_new, by="country")

library(gapminder)
library(gganimate)
library(av)
library(tibble)
data <- new_co2_percap[,c(1,2,6,7,9,21)] #selecting columnns from above dataset
library(rmarkdown)

#plot and animate the graph
ggplot(data, aes(x=percent, y=value, color=income_level)) + geom_point() + facet_wrap(~income_level, nrow = 2) + labs(title = 'Year:{round(frame_time, 0)}', x='% of renewable resources', y='CO2 present per cap') + transition_time(date) + ease_aes('linear') +
  theme(legend.position = "none")